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Abstract 


What are the functionals of the reward that can be computed and optimized exactly 
in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dy- 
namic Programming (DP) can only handle these operations efficiently for certain 
classes of statistics. We summarize the characterization of these classes for policy 
evaluation, and give a new answer for the planning problem. Interestingly, we prove 
that only generalized means can be optimized exactly, even in the more general 
framework of Distributional Reinforcement Learning (DistRL). DistRL permits, 
however, to evaluate other functionals approximately. We provide error bounds 
on the resulting estimators, and discuss the potential of this approach as well as 
its limitations. These results contribute to advancing the theory of Markov Deci- 
sion Processes by examining overall characteristics of the return, and particularly 
risk-conscious strategies. 


1 Introduction 


Reinforcement Learning (RL) has emerged as a flourishing field of study, delivering significant prac- 
tical applications ranging from robot control and game solving to drug discovery or hardware design 
[Lazic et al., 2018, Popova et al., 2018, Volk et al., 2023, Mirhoseini et al., 2020]. The cornerstone of 
RL is the "return" value, a sum of successive rewards. Conventionally, the focus is on computing 
and optimizing its expected value on Markov Decision Process (MDP). The remarkable efficiency 
of MDPs comes from their ability to be solved through dynamic programming with the Bellman 
equations [Sutton and Barto, 2018, Szepesvari, 2010]. RL theory has seen considerable expansion, 
with a renewed interest for the consideration of more rich descriptions of a policy’s behavior than 
the sole average return. At the other end of the spectrum, the so-called Distributional Reinforcement 
Learning (DistRL) approach aims at studying and optimizing the entire return distribution, leading 
to impressive practical results [Bellemare et al., 2017, Hessel et al., 2018, Wurman et al., 2022, 
Fawzi et al., 2022]. Between the expectation and the entire distribution, the efficient handling of 
other statistical functionals of the reward appears also particularly relevant for risk-sensitive contexts 
[Bernhard et al., 2019, Mowbray et al., 2022]. 


Despite recent progress, the full understanding of the abilities and limitations of DistRL to compute 
other functionals remains incomplete, with the underlying theory yet to be fully understood. His- 
torically, the theory of RL has been established for discounted MDPs, see e.g. [Sutton and Barto, 
2018, Watkins and Dayan, 1992, Szepesvari, 2010, Bellemare et al., 2023] for modern reference 
textbooks. Recently more attention was drawn to the undiscounted, finite-horizon setting [Auer, 2002, 


37th Conference on Neural Information Processing Systems (NeurIPS 2023). 


Osband et al., 2013, Jin et al., 2018, Ghavamzadeh et al., 2020], for which fundamental questions 
remain open. In this paper, we explore policy evaluation, planning and exact learning algorithms for 
undiscounted MDPs for the optimization problem of general functionals of the reward. We explicitly 
delimit the possibilities offered by dynamic programming as well as DistRL. 


Our paper specifically addresses two questions: 


(i) How accurately can we evaluate statistical functionals by using DistRL? 


(ii) Which functionals can be exactly optimized through dynamic programming? 


We first recall the fundamental results in dynamic programming and Distributional RL. Addressing 
question (i), we refer to Rowland et al. [2019]’s results on Bellman closedness and provide their 
adaptation to undiscounted MDPs. We then prove upper bounds on the approximation error of Policy 
Evaluation with DistRL and corroborate these bounds with practical experiments. For question 
(ii), we draw a connection between Bellman closedness and planning. We then utilize the DistRL 
framework to identify two key properties held by optimizable functionals. 


Our main contribution is a characterization of the families of utilities that verify these two properties 
(Theorem 2). This result gives a comprehensive answer to question (ii) and closes an important open 
issue in the theory of MDP. It shows in particular that DistRL does not extend the class of functionals 
for which planing is possible beyond what is already allowed by classical dynamic programming. 


2 Background 


We introduce the classical RL framework in finite-horizon tabular Markov Decisions Processes 
(MDPs). We write A(R) the space of probability distributions on R. A finite-horizon tabular MDP 
is a tuple M = (¥,A,p, R, H), where ¥ is a finite state space, A is a finite action space, H is 
the horizon, for each h € |H], pp (x,a,-) is a transition probability law and R,(x,a) is a reward 
random variable with distribution o». The parameters (p) and (R, ) define the model of dynamics. 
A deterministic policy on M is a sequence 7 = (11,..., 77) of functions mp : ¥ > A. 


Reinforcement Learning traditionally focuses on learning policies optimizing the expected return. 
For a given policy 7, the @-function maps a state-action pair to its expected return under 7: 


Q(x, a) = Von [Rr (z, a)] mi X p(z, a, wR (as Th+i1(2')), Qi41(2; a) =0. (Œ) 


s! 


When the model is known, the Q-function of a policy m can be computed by doing a backward 
recursion, also called dynamic programming. This is referred to as Policy Evaluation. Similarly, an 
optimal policy can be found by solving the optimal Bellman equation: 


Q} (z, a) = Eo, [Ra(2,a)] +) pr(z,a,z')max Qipi (z'a), — Qiya(e,a) =0. D 


T 


Solving this equation when the model is known is also called Planning. When it is unknown, 
reinforcement learning aims at finding the optimal policy from sample runs of the MDP. But evaluating 
and optimizing the expectation of the return in the definition of the Q-function above is just one 
choice of statistical functional. We now introduce Distributional RL and then discuss other statistical 
functionals that generalize the expected setting discussed so far. 


2.1 Distributional RL 


Distributional RL (DistRL) refers to the approach that tracks not just a statistic of the return for each 
state but its entire distribution. We introduce here the most important basic concepts and refer the 
reader to the recent comprehensive survey by Bellemare et al. [2023] for more details. The main idea 
is to use the full distributions to estimate and optimize various metrics over the returns ranging from 
the mere expectation [Bellemare et al., 2017] to more complex metrics [Rowland et al., 2019, Dabney 
et al., 2018a, Liang and Luo, 2022]. 


At state-action (x, a), let Z7 (x, a) denote the future sum of rewards when following policy 7 and 
starting at step h, also called return. It verifies the simple recursive formula Z7 (x, a) = Rp (x,a) + 
Zh (X", Th+1(X')) where X’ ~ pp4i(2, a, -). Its distribution is n = (nF (£, @)) (e,a,h)exx Ax [H] 
and is often referred to as the Q-value distribution. One can easily derive the recursive law of 
the return as a convolution: for any two measures 11,72 € A(R), we denote their convolution 
by vı * ¥2(t) = J,» (7)v2(t — 7)dr. For any two independent random variables X and Y, the 
distribution of the sum Z = X + Y is the convolution of their distributions: vz = vx * vy. Thus, 
the law of Z7 (x, a) is 


Va,a,h, nh (@,@) = on(a,a)* >) pax, a,2")nf (2, Tapals") - (3) 


a! 


This equation is a distributional equivalent to Eq. (1) and thus defines a distributional Bellman 
operator n, = Tenhi 


Obviously, from a practical point of view, distributions form a non-parametric family that is not 
computationally tractable. It is necessary to choose a parametric (thus incomplete) family to represent 
them. Even the restriction to discrete reward distributions is not tractable, since the number of 
atoms in the distributions may grow exponentially with the number of steps! [Achab and Neu, 2021]: 
approximations are unavoidable. The most natural solution is to use projections of the obtained 
distribution on the parametric family, at each step of the Bellman operator. This process is called 
parameterization. The practical equivalent to Eq. (1) in DistRL hence writes 


1 


Yz, a,h, flee) <1 (atza) « paleo ikama] (4) 
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where II is the projector operator on the parametric family. The full policy evaluation algorithm in 
DistRL is summarized in Alg.1. 


Algorithm 1 Policy Evaluation (Dynamic Programming) for Distributional RL 


Input: model p, reward distributions op, policy 7 to evaluated, II projection. 

Data: 7 € R#ITIAIN 

Vr,a E€ X X A, Ny (x, a) = ðo 

for h = H — 1 > 0 do 
n, (z, a) = on(z, a) i Dr Pa(z, a, x’), 1 (2", Th41(2')) Va, acxxA 
n, (x,a) =I (n, (£,a)) Va,aexxA 

end for 

Output: 7, (x,a)Vx,a, h 


Distribution Parametrization The most commonly used parametrization is the so-called quantile 
projection. It put Diracs (atoms) with fixed weights at locations that correspond to the quantiles of the 
source distribution. One main benefit is that it does not require a previous knowledge of the support 
of the distribution, and allows for unbounded distributions. 


The quantile projection is defined as 


i 2i+1 
Igu = N 3 6z,, with (z;); chosen such as F, (z;) = N (5) 
which corresponds to a minimal W, distance: lou € arg ming-5, 5,/N Wı(u, Ô), where 
Wı(.,.) is the Wasserstein distance defined for any distributions 11,72 as Wi(11,12) = 


1 i kopr i : : 
J |F 1u) — F; ‘(u)| du. Note that this parametrization might admit several solutions and 
thus the projection may not be unique. For simplicity, we overload the notation to Ion = 


(Hen(z, Q)) (x,a)exxA 


'The support of the return (sum of the rewards) is incremented at each step by a number of atoms that depend 
on the current support. 


For a Q-value distribution 7 with support of length A,,, and parametrization of resolution N, Rowland 
et al. [2019] prove that the projection error is bounded by 
An 


sup Wi (onle, a), nle, a)) <5. © 
(z,a)EX¥XA 


In Section 3, we extend this result to the full iterative Policy Evaluation process and bound the error on 
the returned statistical functional in the finite-horizon setting. Note that other studied parametrizations 
exist but are less practical. For completeness, we discuss the Categorical Projection [Bellemare et al., 
2017][Rowland et al., 2018][Bellemare et al., 2023] in Appendix B. 


2.2 Beyond expected returns 


The expected value is an important functional of a probability distribution, but it is not the only one 
of interest in decision theory — especially when a control of the risk is important. We discuss two 
concepts that have received considerable attention: utilities, defined as expected values of functions 
of the return, and distorted means which place emphasis on certain quantiles. 


Expected Utilities are of the form E[f(Z)], or f f dv, where Z is the return of distribution v and f 
is an increasing function. For instance, when f is a power function, we obtain the different moments 
of the return. The case of exponential functions plays a particularly important role: the resulting 
utility is referred to as exponential utility, exponential risk measure, or generalized mean according 
to the context: 


1 
Ue A logE[exp(AX)] X ~vandAeER. (7) 


This family of utilities has a variety of applications in finance, economics, and decision making 
under uncertainty [Follmer and Schied, 2016]. They can be considered as a risk-aware generalization 
of the expectation, with benefits such as accommodating a wide range of behaviors [Shen et al., 
2014] from risk-seeking when A > 0, to risk-averse when À < 0 (the limit A — 0 is exactly the 
expectation). To fix ideas, Ucxp (N (u, o”)) = u + Xo: each À captures a certain quantile of the 
Gaussian distribution. 


Distorted means, on the other hand, involve taking the mean of a random variable, but with a 
different weighting scheme [Dabney et al., 2018a]. The goal is to place more emphasis on certain 
quantiles, which can be achieved by considering the quantile function F~ of the random variable 
and a continuous increasing function £ : [0,1] — [0,1]. By applying £ to a uniform variable 7 on 
[0, 1] and evaluating F~! at the resulting value G(r), we obtain a new random variable that takes 
the same values as the original variable, but with different probabilities. The distorted mean is then 
calculated as the mean of this new random variable, given by the formula f 8’(7)F~'(r)dr. If 6 
is the identity function, the result is the classical mean. When £ is T ++ min(7/a, 1), we get the 
a-Conditional Value at Risk (CVaR(q)) of the return, a risk measure widely used in risk evaluation 
[Rockafellar et al., 2000]. 


3 Policy Evaluation 


The theory of MDPs is particularly developed for estimating and optimizing the mean of the return 
of a policy. But other values associated to the return can be computed the same way, by dynamic 
programming. This includes for instance the variance of the return, or more generally, any moment 
of order p > 2, as was already noticed in the 1980’s [Sobel, 1982]. Recently, Rowland et al. [2019] 
showed that for utilities in discounted MDPs, this is essentially all that can be done. More precisely, 
they introduce the notion of Bellman closedness (recalled below for completeness) that characterizes 
a finite set of statistics that can efficiently be computed by dynamic programming. 


Definition 1 (Bellman closedness [Rowland et al., 2019]). A set of statistical functionals {s1,...5K} 
is said to be Bellman closed if for each (x,a) E€ X x A, the statistics s1: (nj; (x,a)) can be 
expressed in closed form in terms of the random variables Rj(a, a) and (81:«(np41(X', A’), A ~ 
a(x), X’ ~ pp(a, A’, -), independently of the MDP. 


Importantly, in the undiscounted setting, Rowland et al. [2019](Appendix B, Theorem 4.3) show that 
the only families of utilities that are Bellman closed are of the form {x +> a’ exp(Ax)|0 < £ < L} 


for some L < oo. Thus, all utilities and statistics of the form of (or linear combinations of) moments 
and exponential utilities can easily be computed by classic linear dynamic programming and do not 
require distributional RL (see Appendix A.3). 


Some important metrics such as the CVaR or the quantiles are not known to belong to any Bellman- 
closed set and hence cannot be easily computed. For this kind of function of the return, the knowledge 
of the transitions and the values in following steps is insufficient to compute the value on a specific 
step. In general, it requires the knowledge of the whole distribution of each reward in each state. 
Hence, techniques developed in distributional RL come in handy: for a choice of parametrization, 
one can use the projected dynamic programming step Eq. (4) to propagate a finite set of values along 
the MDP and approximate the distribution of the return. In the episodic setting, following the line of 
Rowland et al. [2019] (see Eq.(6)), we prove that the Wasserstein distance error between the exact 
and approximate distribution of the Q-values of a policy is bounded. 


Proposition 1. Let 7 be a policy and 7“ the associated Q-value distributions. Assume the return is 
bounded on a interval of length A, < HAp, where Ap is the support size of the reward distribution. 
Let 7" be the Q-value distributions obtained by dynamic programming (Algorithm 1) using the 
quantile projection Ig with resolution N. Then, 


A a 
sup Wi (x,a), ni (x£,a)) < HÈ < P? . 
M es 1 (hh (x, a), np (2,a)) aN N 


This result shows that the loss of information due to the parametrization may only grow quadraticly 
with the horizon. The proof consists of summing the projection bound in (6) at each projection step, 
and using the non-expansion property of the Bellman operator [Bellemare et al., 2017]. The details 
can be found in Appendix C 


The key question is then to understand how such error translates into our estimation problem when 
we apply the function of interest to the approximate distribution. We provide a first bound on this 
error for the family of statistics that are either utilities or distorted means. 


First, we prove that the utility is Lipschitz on the set of return distributions. 


Lemma 1. Let s be either an utility or a distorted mean and let L be the Lipschitz coefficient of its 
characteristic function. Let vı, V2 be return distributions. Then: 


|s(v1) = s(v2)| < LW: (v1, v2) 7 


Both family of functionals are treated separately, but lays a similar bound. The utility bound is the 
direct application of the Kantorovitch-Rubenstein duality, while the distorted mean one is a direct 
majoration in the integral. Again, the details are provided in the Appendix. 


This property allows us to prove a maximal upper bound on the estimation error for those two families. 


Theorem 1. Let 7 be a policy. Let n" be the Q-value return distribution associated to 7 with the 
return bounded on a interval of length A, < HApr where Ap is the support size of the reward 
distribution. Let 4" be the approximated return distribution computed with Algorithm 1, for the 
projection Ilo with resolution N. Let s be either an expected utility or a distorted mean, and L the 
Lipschitz coefficient of its characteristic function. Then: 

Ar 


nae = A 
oy [s(n (x, a)) — s(n (a, @))| < LSN = a 


Note that depending on the choice of utilities, the Lipschitz coefficient L may also depend on H 
and AR. For instance, in a stationary MDP, the Lipschitz constant of the exponential utility depends 
exponentially on A}. For the CVaR (a), however, L is constant and only depends on a € (0, 1). 


Experiment: empirical validation of the bounds on a simple MDP We consider a simple Chain 
MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], 
with a single action leading to the same discrete reward distribution for every step. We consider a 
Bernouilli reward distribution 6(0.5) for each state so that the number of atoms for the return only 
grows linearly? with the number of steps, which allows to compute the exact distribution easily. 


*At round h € [H], the support of the return is {0, 1, ...,h}, so h atoms. 


R~en=2@ R~ n= 


Yh € [H], p(k + 1|h,a =1)=1 


Figure 1: A Chain MDP of length H with deterministic transition and identical reward distribution 
for each state. 
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Figure 2: Left: Validation of Theorem 1 on CVaR (a) together with the scaled upper bound (see main 
text for discussion): the quadratic dependence in H is verified. Right: Validation of Proposition 1: 
The cumulative projection error (dashed blue) is the sum of the projection errors at every time step, 
and matches the true approximation error (solid blue). The theoretical upper bound (dashed red) 
differs only by a factor 2. 


We compare the distributions obtained with exact dynamic programming and the approximate 
distribution obtained by Alg 1, with a quantile projection with resolution N = 1000. Note that even 
at early stages, when the true distribution has less atoms than the resolution, the exact and approximate 
distributions differ due to the weights of the atoms in the quantile projection. Figure 2 (Right) reports 
the Wasserstein distance between the two distributions: the cumulative projection approximation 
error (dashed blue), the true error between the current exact and approximate distributions (solid 
blue) and the theoretical bound (red). Fundamentally, the proof of Prop. 1 upper bounds the distance 
between distributions by the cumulative projection error so we plot this quantity to help validating it. 


We also empirically validate Theorem | by computing the CVaR(q) for a € {0.1, 0.25}, correspond- 
ing respectively to distorted means with Lipschitz constants L = {10, 4}. We compute these statistics 
for both distributions and report the maximal error together with the theoretical bound, re-scaled* 
by a factor 5. Figure 2 (Left) shows an impressive correspondence of the theory and the empirical 
results despite a constant multiplicative gap. 


4 Planning 


Planning refers to the problem of returning a policy that optimizes our objective for a given model 
and reward function (or distribution in DistRL). It shares with policy evaluation the property to be 
grounded on a Bellman equation: see Eq. (2) for the classical expected return, which leads to efficient 
computations by dynamic programming. 


For other statistical functionals of the cumulated reward, however, can the optimal policy be computed 
efficiently? The main result of this section characterizes the family of functionals that can be exactly 
and efficiently optimized by Dynamic Programming. 


In the previous section, we recalled that Bellman-closed sets of utilities can be efficiently computed 
by DP as long as all the values of the utilities in the Bellman-closed family are computed together. For 
the planning problem, however, we only want to optimize one utility so we cannot consider families 
as previously. Under this constraint, only exponential and linear expected utilities are Bellman closed 
and thus can verify a Bellman Equation. In fact, for the exponential utilities, such Bellman Equation 


+Scaling by a constant factor allows us to show the corresponding quadratic trends. 


exists and allows for the planning problem to be solved efficiently [Howard and Matheson, 1972]: 


1 
QMe,a) = Uàp (Ra (z, a)) + 5 log b pn(a,a,2") exp (Amax Qh ya (0, °)| (8) 


witha) =0. 


However, the question of the existence of Optimal Bellman Equations for non-utility functionals 
remains open (e.g. quantiles). More generally, an efficient planning strategy is not known. To 
address these questions, we consider the most general framework, DistRL, and recall the theoretical 
DP equations for any statistical functional s in the Pseudo*-Algorithm 2. DistRL offers the most 
comprehensive, or ‘lossless’, approach, so if a statistical functional cannot be optimized with Alg. 2, 
then there cannot exist a Bellman Operator to perform exact planning. 


Algorithm 2 Pseudo-Algorithm: Exact Planning with Distributional RL 
1: Input: model p, reward R, statistical functional s 
2: Data: n € R#IZIAIN y e REIN 
3: Va eX, Vir = 00 
4: for h = H > 1 do i 
5 n, (x,a) = om) x la D(a, T Vh VaaEerxxA 
6: vE = n, (x,a), a* € arg max, s(ņ,(x,a)) Vr eX 
7 
8 


: end for 
: Output: 7, (x,a) Vx, a,h 


We formalize this idea with the new concept of Bellman Optimizable statistical functionals: 


Definition 2 (Bellman Optimizable statistical functional). A statistical functional s is called Bellman 
optimizable if, for any MDP M, the Pseudo-Algorithm 2 outputs an optimal return distribution 
n = 1% that verifies: 


Va,a,h,  8(mp(2,4a)) = sup s(n (x,a)) - (9) 
T 
Remark. This definition is equivalent to the satisfaction of an Optimal Distributionnal Bellman equa- 


tion. Indeed, a statistical functional s is Bellman Optimizable if and only if, for any (X,.A, 0, p,n), s 
verifies 


sups (e ; ¥en) =5 (e $ ¥ ene) 
with aX € arg max, s(ņn(x,a)) 


We can now state our main results that characterizes all the Bellman optimizable statistical functionals. 
First, we prove that such a functional must satisfy two important properties. 


Lemma 2. A Bellman optimizable functional s satifies the two following properties: 
e Independence Property: If v1, v2 E€ A(R) are such that s(v1) > s(v2), then 
Vvs € A(R), VA € [0,1], s(Avı + (1 — A)v3) > s(Ave + (1 — A)vs)) . 
° Translation Property: Let Te denote the translation on the set of distributions: T.b~ = Ôz+c- 
If vi, v2 € A(R) are such that s(v1) > s(v2), then 
YcE R, s(Tevi) > s(Teva). 
Indeed, the expectation and the exponential utility both satisfy these properties (see Appendix A.2). 


Each property is implied by an aspect of the Distributional Bellman Equation (Alg. 2, line 5) 
and the proof (in Appendix E) unveils these important consequences of the recursion identities. 


“This theoretical algorithm handles the full distribution of the return at each step, which cannot be done in 
practice. 


Fundamentally, they follow from the Markovian nature of policies optimized this way: the choice of 
the action in each state should be independent of other states and rely only on the knowledge of the 
next-state value distribution. 


The Independence property states that, for Bellman optimizable functionals, the value of each next 
state should not depend on that of any other value in the convex combination in the rightmost term of 
the convolution. In turn, the Translation property is associated to the leftmost term, the reward, and it 
imposes that, for Bellman optimizable functionals, the decision on the best action is independent of 
the previously accumulated reward. 


The Independence property is related to expected utilities [von Neumann et al., 1944]. Any expected 
utility verifies this property (Appendix A.2) but, most importantly, the Expected Utility Theorem (also 
known as the Von Neumann Morgenstein theorem) implies that any continuous statistical functional 
s verifying the Independence property can be reduced to an expected utility. This means that for 
any such statistical functional s, there exists f continuous such that Vv, v2 € A(IR), we have 
s(V,) > s(v2) <=> Us(™) > Up(v2) [von Neumann et al., 1944, Grandmont, 1972]. 


This result directly narrows down the family of Bellman optimizable functionals to utilities. Indeed, 
although other functionals might potentially be optimized using the Bellman equation, addressing 
the problem on utilities is adequate to characterize all possible behaviors. For instance, moment- 
optimal policies that can be found through dynamic programming, can also be found by optimizing 
an exponential utility function. The next task is therefore to identify all the utilities that satisfy 
the second property. We demonstrate that, apart from the mean and exponential utilities, no other 
W-continuous functional satisfies this property. 


Theorem 2. Let o be a return distribution. The only W; -continuous Bellman Optimizable statistical 
functionals of the cumulated return are exponential utilities Ucxp(0) = + log E, [exp(AR)] for 
A € R, with the special case of the expectation E, |R] when \ = 0. 


Of course, if U (p) is a utility and ~ is a monotonous scalar mapping, w(U(p)) is an equivalent 
utility: one should understand in the previous theorem that a W,-continuous Bellman Optimizable 
statistical functional is equivalent to Uexp(@) for some A € R. We chose here to define Ucxp(e) = 
+ log E, [exp(AR)| with the log and the factor 1/A since it results in a normalized utility that tends 
to the expectation when A goes to 0. The full proof of Theorem 2 is provided in Appendix E. 


We make a few important observations. First, this result shows that algorithms using Bellman updates 
to optimize any continuous functionals other than the exponential utility cannot guarantee optimality. 
The theorem does not apply to non-continuous functionals, but Lemma 2 still does. For instance, the 
quantiles are not W,-continuous so Theorem 2 does not apply, but it is easy to prove that they do not 
verify the Independence Property and thus are not Bellman Optimizable. Also, there might also exist 
other optimizable functionals, like moments, but they must first be reduced to exponential or linear 
utilities. 


Most importantly, while in theory, DistRL provides the most general framework for optimizing 
policies via dynamic programming, our result shows that in fact, the only utilities that can be exactly 
and efficiently optimized do not require to resort to DistRL. This certainly does not question the 
very purpose of DistRL, which has been shown to play important roles in practice to regularize or 
stabilize policies and to perform deeper exploration [Bellemare et al., 2017, Hessel et al., 2018]. 
Some advantages of learning the distribution lies in the enhanced robustness offered in the richer 
information learned [Rowland et al., 2023], particularly when utilizing neural networks for function 
approximation [Dabney et al., 2018b, Barth-Maron et al., 2018, Lyle et al., 2019]. 


5 Q-Learning Exponential Utilities 


The previous sections consider the the model, i.e. the reward and transition functions, be known. 
Yet in most practical situations, those are either approximated or learnt?. After addressing policy 
evaluation (Section 3) and planning (Section 4), we conclude here the argument of this paper by 
addressing the question of learning the statistical functionals of Theorem 2. In fact, we simply 
highlight a lesser known version of Q-Learning Watkins and Dayan [1992] that extend this celebrated 
algorithm to exponential utilities. We provide the pseudo-code for the algorithm proposed by Borkar 


5Kither explicitly (model-based RL) or implicitly (model-free RL, considered here). 


Algorithm 3 Q-Learning for Linear and Exponential Utilities 


1: Input: (a;)zen, transition and reward generator. Qa (x,a) + H ,V(a,a,h) € X x Ax |H] 
2: Utilities: Linear (Z +> AZ + b) or Exponential (Z + log(Eexp(AZ))/A) 

3: for episode K = 1,..., K do 

4: Observe xı € V 

5 for steph =1,...,H do 

6: Choose action ap € arg max e4 Qn(ap, a) 

7 

8 


Observe reward r, and transition 2,4; and update for chosen objective: 
Linear Util.: Qna (£h, an) <— (1 — ax)Qn(tn, an) + ax[A(rn + maxar Qn4i(@n41,0')) +0] 


9: Exponential Util.: Qn(an, an) + 1 log [a E ap)e™ Onnan) + aped" h tmax gs Satin sae i 
10: end for 
11: end for 


12: Output: Qa (x, a)Vz,a 


[2002, 2010] with the relevant utility-based updates in Alg. 3. We refer to these seminal works 
for convergence proofs. Linear utility updates (line 8) differ only slightly from classical ones for 
expected return optimization, which have been shown to lead to the optimal value asymptotically 
[Watkins and Dayan, 1992]. 


6 Discussions and Related Work 


The Discounted Framework We focused in this article on undiscounted MDPs, and it is important 
to note that the results differ for discounted scenarios. The crucial difference is that the family of 
exponential utilities no longer retains Bellman Closed or Bellman Optimizable properties due to the 
introduction of the discount factor y [Rowland et al., 2019]. When it comes to Bellman Optimization, 
the necessary translation property becomes an affine property : Vc, y, s(T2 v1) > s(72’v2) where 72’ 
is the affine operator such that 76, = 6y2+c. This property is not upheld by the exponential utility. 
Nonetheless, there exists a method to optimize the exponential utility through dynamic programming 
in discounted MDPs [Chung and Sobel, 1987]. This approach requires modifying the functional to 
optimize at each step (the step h is optimized with the utility z + exp(y~"A «)), but it also implies 
a loss of policy stationarity, property usually obtained in dynamic programming for discounted 
finite-horizon MDPs [Sutton and Barto, 2018]. 


Utilizing functionals to optimize expected return. DistRL has also been used in Deep Reinforce- 
ment learning to optimize non-Bellman-optimizable functionals such as distorted means[Ma et al., 
2020, Dabney et al., 2018a]. While, as we proved so, such algorithms cannot lead to optimal policies 
in terms of these functionals, experiments show that in some contexts they can lead to better expected 
return and faster convergence in practice. The change of functional can be interpreted as a change in 
the exploration process, and the resulting risk-sensitive behaviors seem to be relevant in adequate 
environments. 


Dynamic programming for the optimization of other functionals To optimize other statistical 
functionals such as CVaR and other utilities such as moments with Dynamic Programming, Bauerle 
and Ott [2011] and Bauerle and Rieder [2014] propose to extend the state space of the original MDP 
to X’ = X x R by theoretically adding a continuous dimension to store the current cumulative 
rewards. This idea does not contradict our results, and the resulting algorithms remain empirically 
much more expensive. 


Another recent thread of ideas to optimize functionals of the reward revolve around the dual formula- 
tion of RL through the empirical state distribution [Hazan et al., 2019]. Algorithms can be derived 
by noticing that utilities like the CVar are equivalent to solving a convex RL problem [Mutti et al., 
2023]. 


7 Conclusion 


Our work closes an important open problem in the theory of MDPs: we exactly characterize the 
families of statistical functionals that can be evaluated and optimized by dynamic programming. 


We also put into perspective the DistRL framework: the only functionals of the return that can be 
optimized with DistRL can actually be handled exactly by dynamic programming. Its benefit lies 
elsewhere, and notably in the improved stability of behavioral properties it allows. We believe that, 
by narrowing down the avenues to explain its empirical successes, our work can contribute to clarify 
the further research to conduct on the theory of DistRL. 
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A Additional remarks 


The Wasserstein metric is defined as Wi(1,v2) = i |F (u) — F; (u)| du and the 
1 
Cramer metric as l2(v1, v2) = ( 2 |F„ (u) — F,,(u)|? du) ’ For both metrics, we de- 


fine their supremum lə(,; n2) = SUP(z,a)exx A l2(m (x,a), n(x, a)) and Wilm, n) = 
SUP(a,a)EXxA Wi (1 (x, @), M(x, @)). 


A.1 Remarks on the recursive definition of the Q-value distribution 


The notation of the Q-value distribution 77 is often deceivingly complex compared to the actual object 
it means to represent. While the ’usual’ expected Q-function Q(x, a) is simply understood as the 
expected return of a policy at state-action pair (x, a), DistRL requires us to keep a notation for 


the complete distribution of the return. In other words, the Q-value distribution ie should be 


understood as the distribution of the random variable Z = R + Z( S"), which is the convolution of 
the individual distributions of these two independent random variable. It can also be written: 


vzat naa) => | peaa rile! nb ale a) REA). 0 


al $ 
s'a 


A.2 Linear and Exponential Utilities satisfy the properties of Lemma 2 


Independence Property Any utility Uy verifies the independence property. Let v1, V2, V3 € 
P (R), A € [0, 1]. Assume Us (v1) > Us(v2). Then, 


U (Avi + (1 = A)v3) = J faa + (1 m A)v3) 


aa feat) | fa 


Us (v1) 


>A f fare+(1—2 f fare 
i 
Uz (v2) 


= J faa + (1 — ps 


= Up (Av2 + (1 = A)v3) 
In particular, the mean and the exponential utility do. 
Translation Property This property comes from the linearity of the mean and the multiplicative 


morphism of the exponential. Let vı, v2 E€ A(R), c € R. Assume that Uexp (v1) > Uexp (v2) and 
Umean (1) > Umean (V2). Then, 


Urso(ters) = | exple)areri (r) 
= f expt + c)ara(r) 
= exp(e) | exp(r)dra(r) 


= exp(c)Uexp (1) 
exp(c)Uexp (V2) 


IV 


Vexp (Teva) 
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Umean(Te1) = f~ +b + cdr 


=C U mean (V1) 
c+ U mean (V2) 


IV 


= Umean(Tel2) 
A.3 Policy Evaluation for linear combinations of moments 


Rowland et al. [2019] prove a necessary condition on Bellman-closed utilities, namely that they 
should be a family of the form {x +> x‘ exp(A\x)|0 < £ < L} for the undiscounted, finite-horizon 
setting. In the discounted case, the necessary condition is only valid for A = 0, that is, without 
the exponential. They also prove that moments (without the exponential) also verify the sufficient 
condition such that in that setting they are the only Bellman-closed families of utilities. 


In the undiscounted setting, to the best of our knowledge, a similar result has not yet been proved. 
We provide here the sufficient condition for families of the form {x +> x exp(Ax)|0 < £ < L}. We 
show that they are Bellman-closed and that this implies that they can be computed by DP. 


Let’s consider the family s,(v) = f r? exp(Ar)dv(r) for k € [n] and some fixed À € R. 


8n(mh (x, @)) 
= E[Z; (a, a)” exp(AZ; (2, a))] 


=E [(Rn(x, a) + Ziy (X', A’))” exp(A(Ra (x, a) + Ziy (X',A'))] 
= Be! sa! (Miers ese [(Ra (£, a) + Zroila’, a’))” exp(A(Ra(z, a) + Zila’, a) |X’ = a’, A’ = a']] 
= 5 pn(@, a, Lyne a’ YER), Zhai [(Ra(z, a) + Zryi(a’,a’))” exp(A(Ra(a, a) + Zh+i(z',a’))| 


= 5 ph(x, a,x )rh (a JER, 2,41 ps e Rn (x, a)” * exp(ARn(a, a)) Zipi (2, a’)* exe Zhe 


k=0 


= Y pr(a,a,2")nf' (a’) Y (a) Eii, [Rn (æ, a)" exp(ARn (x, a))] Ezr, [Zina (2, a)" exp(A Zia (a, a")] 
k=0 


= Y pr(a,a,2")ni (a’) Y (;) Er, [Rn(x,a)"* exp(ARn(e,a))] sx (nisi (2',a')) 


z'a’ k=0 


This first proves that this family of statistical functional is Bellman closed: they can be expressed as a 
linear combination of the others. Moreover, on the right-hand side, the expression only depends on 
the distributions and functionals at the current step h and at the next step h + 1. Thus, it provides a 
natural way to evaluate these functionals by DP. 


B Categorical Projection: an alternative parametrization 


The categorical projection was proposed and studied in Bellemare et al. [2017], Rowland et al. [2018], 
Bellemare et al. [2023]. For a bounded return distribution, it spreads a fixed number N of Diracs 
evenly over the support and used weight parameters to represent the true distribution. The parameter 
N is often referred to as the resolution of the projection. More precisely, on a support [Vinin, Vmax], 
we write A = Vinex = Vinin the step between atoms z; = Vmin + iA, i € [0,N — 1]. We define the 


projection of a given Dirac distribution 6, on the parametric space Ao(R) = {9 piĝz| 0 < pi < 
Sxo y < Zo 
Ie (dy) T nE Öz; 4 a Sint Zi < Y< Zit 1) 
T y > ZN-1 
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This definition can naturally be extended to any bounded distribution v, and by extension, Ion = 
(Hon(s,@))(s,a)exx.A- This linear operator minimizes the Cramér distance of the parametrization to 
the parametric space [Rowland et al., 2018]. 


This projection verifies this approximation bound, analog to the quantile projection, 


A 
sup éa(IIon(x,a),n(a,a)) < ar (12) 
(x, a)EXXA 


Using the property that W1 (11, v2) < VA, be (11, v2), the results with the quantile projection can be 
adapted for the categorical projection, adding VAn factors to the bounds. 
This parametrization has the nice property of preserving the mean of the distribution. Yet, even for 


risk-neutral RL, Quantile-based DistRL algorithm seem to work better and display better proper- 
ties[Dabney et al., 2018b,a]. 


C Proofs for Policy Evaluation with parameterized distributions 


C.1 Proof of Proposition 1 


We recall the statement of Proposition 1: Let 7 be a policy and 7” the associated Q-value distributions. 
Assume the return is bounded on a interval of length A, < HAp, where Ap is the support size 
of the reward distribution. Let 7” be the Q-value distributions obtained by dynamic programming 
(Algorithm 1) using the quantile projection IIo with resolution N. Then, 
A Ar 
sup Wi (iin (2, @), nh (2,4) < HT < Ho. 
(v,a,h)€(X,A,[H]) ' ‘ 2N 2N 


To avoid clutter of notation, we denote W1 (h, n) := SUP(a ajexx a Wa (Ñ (2, a), n(x, a)). 


Proof. First recall that for any Q-value distribution (7, )ne[#), With the return bounded on an interval 
of length A, < H Ap, and II one of the projection operator of interest with resolution n, we have 
the following bound on the projection estimation error due to Rowland et al. [2019] (Eq (6)): 


W (n,n) < Ai (13) 
1 1,7) S 2N : 
At a fixed step h € [H], we have the following inequality: 
Wai (ho mh) = Wa UTR tina. Tr Masa) 
< Wi (UTE hni: Tn Anta) + WilTe Anti Tr M41) (14) 
AR TEF aT T 
sH- + Wilh Mpya) . (15) 


2N 


Where (14) is due to the triangular inequality with 7," ñ}, as the middle term. In (15), the first term 
comes from applying (13) to the first term of the previous line. The second term is a consequence of 
the non-expansive property of the Bellman operator [Bellemare et al., 2017]: 


Wi(Tm,T 2) < Wilm, ne) - 
Using it recursively starting from h = 1, and using the fact that 47, = n4 we get: 


Ti, ATT T AR TAT ATT T AR TAT ATT T L 
Wilt nt) < Hgy + Wile. nt) < 2H py + Wil. ng) Seo s ESN 
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C.2 Proof of Lemma 1 


We recall the statement of Lemma 1: Let s be either a utility or a distorted mean and let L be the 
Lipschitz coefficient of its characteristic function. Let v1, v2 be return distributions. Then: 


|s(11) = s(v2)| < LW (v1, v2) è 


Proof. We prove the property for each family of utilities separately: 


Case 1: sis a utility. There exists f such that s(v) = f fdv. Let Ly be its Lipsichtz constant. The 
Kantorovitch-Rubenstein duality [Villani, 2003] states that: 


Wi (ni, v2) = 2a sup (Jo ar- fg ava) ; (16) 
Lf Ilolle <E; 


where ||- ||; is the Lipschitz norm. We then immediatly get: 
LW (4, v2) > fs dy -fs dv] = |s(v,) — s(v2)| . (17) 


Case 2: s is a distorted mean. There exists g such that s(v) = i g(r) F7"(r)dr. Let Ly be its 


Lipschitz coefficent. Thus: 


|s(v1) — s(v2)| = fv (Ea =F (T) dr| 


1 
< ig'li f 
0 


< LgWi(1, v2). 


VA v2 


Fz} (r) — Fz" (r)|ar 


D About the tightness of Theorem 1 


The upper bound provided by Theorem 1 is mainly based on Proposition 1. The latter is obtained by 
summing, for every step, the Projection Bound by Rowland et al. [2019] (Eq. 6). Thus, achieving the 
bound would requires first to find a problem instance for which, at every step, the projection bound is 
tight. Then it would require to verify that the total error is the sum of those projection errors. 


The experiment in Figure 2 already shows that summing the total error is very close to the sum of the 
projection errors. However, in that example, the projection error bound is not reached after the first 
step. In the following, we exhibit an MDP for which the projection bound is tight at every timestep. 


First, let us consider a family of distributions for which the projection error is tight: 
Proposition 2. Let N € N, A € R*. Consider z; = 5, The distribution 


N-1 


1 
YN,A = 2N > (52; + ziji) 


has a support of length A and verifies W1 (vN a, IIQYN, A) = A i 


Proof. The cumulative distribution function (CDF) the distribution vya is 


0 r<0= Z0 5 
Fyy a(t) = 2 = 7; BSL < B41, 
1 a>l=zy. 
We write 7; = aot. so that Vi € [|0, N—1]], Fry... (2) = Ti. We now explicit the projection of ya 


and the Wasserstein distance relative to it. A possible Quantile projection is Igvy,a = + pals, Örs 
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and 


N-1 itl 
-5 flew) - Fg} (w) law 
i=0 “N —~ 
zi 
N-1 pri iki 
_ J ei -alaw + f ET, 
o nne a aga 
Zi Zi+1 
N—1 
1 
= aap ett — 2i) 


a 
| 


The value distribution of the following time steps is obtained by applying two operators: the Bellman 
operator and the projection operator. Here will consider a MDP with only one state. We need to 
find such operators so that IIQ TYN, A, = YN2A., Where the Bellman operator simply consists in the 
added reward distribution. 

Proposition 3. Let N € N, A € R®. Consider 0 = $(50 + ô â ). Then there exists a Quantile 
projection operator IIo such that 


e * Tolna) = VN (a+yAy) 


Proof. We consider IIo (vyn, a) = + Dara Ô ia Since Vi, Ra <2 < GIDA we have F (zi) = 


Ti, verifying that it is indeed a valid Quantile Projection. 


Hence, 


ae Maton.) = jatia) (5 sya) 


1 N-1 
— (5a, + õuna ) 
2 N-I Ni 
7=0 
= VN (A+) 


The last egality comes down to noticing that — = 7(A+ = ). 


Here, we take advantage of the fact that the Quantile Projection is not unique (see discussion in 
section 2.1). By choosing the adapted projection, it is then possible to obtain one of those distribution 
at every timestep. 


Corollary 1. Let N € N, Ag € R*. Consider the sequence An. = Ap(1 + ay) and 0, = 
$(69 +ô an ). Consider Ig as in Prop. 3. 
N-1 
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== Cumulative Approximation Error 
= True Error 


Wasserstein Distance 


0 2 4 6 8 10 12 14 
timestep 


Figure 3: Evaluation of the Wasserstein Distance between the true value distribution and the approxi- 
mated one, in the MDP described in Corollary 1 


Consider the MDP with only one state x and action a, reward distribution op, horizon H. Consider 
Ñ, the value distributions obtained throught dynamic programming with quantile projection. Then, at 
any timestep h, the error induced by the projection operator matches the bound in Eq. 6: 


An 

W, (IL = — 
1 IgM) 2N 

While the projection error is maximal in such instance, it was verified experimentally that the bound in 

Prop. 1 was still not tight. This comes from the fact that the Bellman operator is not a non-contraction 

is this case, and that the triangular inequality used to sum the projection error is not tight either. 


We hence found that every inequality used in proving Theorem 1 can be tight, but there does not 
seem to exist an instance for which all the inequalities are tight at the same time, meaning that this 
bound would be never reached exactly. 


E Proof of the main results 


The proof of the result is divided in parts. First we show that Bellman optimizable functionals verify 
the two properties of Lemma 2 (Independence and Translation). Then, using those properties, we 
prove that Bellman optimizable functionals can only be exponential utilities (Theorem 2). Using the 
known fact that exponential utilities are bellman optimizable, we obtain the full characterization. 


E.1 Proof of Lemma 2 


Proof. To prove that each property is necessary, we use a proof by contradiction, and exhibit 
MDPs where the algorithm is not optimal when the property is not verified. 


Independence Property Let s be a Bellman optimizable statistical functional that does not satisfy 
the Independence property. That is, there exists v1, v2,V⁄3 € A(R) and A € [0,1] such that 
s(vı) > s(v2) but s(Avı + (1 — A)v3) < s(Av2 + (1 — A)v3) . 


Then consider the MDP in Fig.4 (left) with horizon H = 2 corresponding to the depth of the tree: 
The agent starts in Start and must take 2 actions, a unique but random and non-rewarding one (ao) 
and a final deterministic step (a, or a2) to a rewarding state. Thus, by construction, the optimal 
strategy is (ao, a2) that leads to End 2 with probability À (and End 3 with probability 1 — A). The 
true optimal distribution at Start state is 7} = Av2 + (1 — A)v3. We compute the distributions 
output by the algorithm: 


H=2: n (End 1, a*) = ðo, m (End 2,a*) = ðo, n (End 3, a*) = ĝo 
H=1: n (Left, a* = argmaxs(v,)) = v1, (Right, a* = a1, a2) = v3 


H=0: No (Start, a* = ao) = Avı + (1 — A)v3 
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Figure 4: Left: Independence Property Counter Example, Right: Translation Property Counter 
Example. Each arrow represents a state transition, which is characterized by the action leading to the 
transition, the probability of such transition, and the reward distribution of the transition. 


The output return distribution nọ is not the true optimal 75 for s so the algorithm is incorrect which is 
a contradiction as s is assumed to be Bellman optimizable. Hence the property is needed. 


Translation Property Let s be a Bellman optimizable statistical functional that does not verify 
the Translation Property, i.e. there exists v1, v2 E€ A(R), c € R such that s(v,) > s(v2) but 
8(TeV1) < s(Tev2) . Then consider MDP in Fig.4 (right). The optimal strategy is again (ao, a2) by 
construction. The algorithm output the following distribution: 


H=2: No (Left, a*) = ôo, n(Right, a*) = do 
H=1: nı (Step, a*) = vı 
H=0: No (Start, a*) = Tevi 


So here again, the algorithm does not output an optimal distribution for s, hence the necessity of the 
property. 
This proof shows that both properties are necessary, but not that they are sufficient. The other 


implication could be proven, but the proof would be unnecessary as those properties are enough to 
restrict to only | class of function for which we already know is bellman optimizable. 


E.2 Proof of Theorem 2 


Proof. For any return distribution v, let s(v) be the considered functional. Since we assume that s 
is W,-continuous Bellmman optimizable, the by the Independence Property (Lemma 2) we know 
from the Expected Utility Theorem that we can assume: s(v) = sp(v) = fg f(a)dv(x) for some 
continuous, monotonous mapping f. Without loss of generality, we may assume by a density 
argument that f is twice continuously differentiable. 


By the Intermediate Value Theorem, we can then define (h) = f~1(3(f(0) + f(h))), so that 
1(£(0) + f(h)) = f((h)) and in particular 6(0) = f-¥(f(0)) = 0. 

A very special case is when f is constant, which satisfies the theorem. We now assume that there 
exists point zo such that f’ does not vanish on a neighborhood of zo. On this neighborhood, using 
the inverse function theorem, ¢ is also twice differentiable. Without loss of generality, we assume 
that zo = 0. 

For any fixed h > 0, we consider the probability distributions vı = $(6 + ôn) and v2 = fgh): 


0 
Remark that s5(v1) = f f dvr = $(f(0) + f(h)) and s¢(v2) = f(d(h)) = $(f(0) + F(h)) by 
definition of ¢, so sf (v1) = sf(v2) . 
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The Translation property implies that for all x € R, sf(11) < sp(v2) => se(i(-+2)) < 
sp(vo(-+ x)) and sr(11) > sp(vo) => sp(vi(-+2)) > sp(vo(-+2)). Hence, Vz € R, 
a(f(2) + f(w@t+h)) = fw + (A). 
This equation can be differentiated twice with respect to h. For any value of x, we obtain: 
1 
af (ath) =P (h) f(a + o(h)) and (18) 
1 
gf ath) = bh) i (w+ Ah) +b (hy f" (2 + Alh) . (19) 


Recall that by definition, (0) = 0. Now, for x = 0, Eq. (18) yields $f’(0) = 4'(0)f’(#(0)) = 
¢' (0) f’(0) and, since f’(0) # 0, (0) = 5 
Now, choosing h = 0 in (19) and plugging in the values of (0) and ¢’(0), we obtain for all x € R: 


1 " — i 1 
37" @) = POF). 


We then consider two cases, depending on whether ¢” (0) is null or not. 


Case 1: ¢” (0) = 0. The equation simply becomes f” (x) = 0, hence f is affine: 3a,b € R, f(x) = 
ax + b. 


Case 2: (0) # 0. We write 8 = 4” (0). The differential equation becomes f”(x) = 8 f'(x), 
whose solutions are of the form 


4c1,¢2,8, f(x) =cexp(Gr) +c. 


Hence, f can only be the identity or the exponential, up to an affine transformation. 
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